Analysis of Crime and Temperature Data

Introduction

This report examines the evolving relationship between patterns of crime and trends in temperature, presuming that discoveries may provide meaningful insights into life-saving interventions and public safety strategies. This research encompasses three distinct steps: 1) clean and prepare crime and temperature data as analytical data, 2) examine each dataset independently to determine trends and patterns, and 3) combine crime and temperature datasets to facilitate investigation as to whether one variable affects the other. Using stroboscopic visualizations accompanied by detailed statistical analyses, this report intends to produce ideas that can support data-powered decision-making to promote public safety, future law enforcement operations and future urban planning.

1. Crime Data Analysis

1.1 Crime Data Preparation & Cleaning

# Load required libraries
library(readr)
library(dplyr)
library(lubridate)
# Load the crime dataset
crime_cases <- read_csv("crime24.csv")
# Print column names
colnames(crime_cases)
##  [1] "...1"             "category"         "persistent_id"    "date"            
##  [5] "lat"              "long"             "street_id"        "street_name"     
##  [9] "context"          "id"               "location_type"    "location_subtype"
## [13] "outcome_status"
# Show summary statistics
summary(crime_cases)
##       ...1        category         persistent_id          date          
##  Min.   :   1   Length:6304        Length:6304        Length:6304       
##  1st Qu.:1577   Class :character   Class :character   Class :character  
##  Median :3152   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :3152                                                           
##  3rd Qu.:4728                                                           
##  Max.   :6304                                                           
##       lat             long          street_id       street_name       
##  Min.   :51.88   Min.   :0.8788   Min.   :2152686   Length:6304       
##  1st Qu.:51.89   1st Qu.:0.8966   1st Qu.:2153025   Class :character  
##  Median :51.89   Median :0.9013   Median :2153155   Mode  :character  
##  Mean   :51.89   Mean   :0.9029   Mean   :2153873                     
##  3rd Qu.:51.89   3rd Qu.:0.9088   3rd Qu.:2153366                     
##  Max.   :51.90   Max.   :0.9246   Max.   :2343256                     
##  context              id            location_type      location_subtype  
##  Mode:logical   Min.   :115954844   Length:6304        Length:6304       
##  NA's:6304      1st Qu.:118009952   Class :character   Class :character  
##                 Median :120228058   Mode  :character   Mode  :character  
##                 Mean   :120403000                                        
##                 3rd Qu.:122339060                                        
##                 Max.   :125550731                                        
##  outcome_status    
##  Length:6304       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
# Check total missing values per column
colSums(is.na(crime_cases))
##             ...1         category    persistent_id             date 
##                0                0              732                0 
##              lat             long        street_id      street_name 
##                0                0                0                0 
##          context               id    location_type location_subtype 
##             6304                0                0             6282 
##   outcome_status 
##              710

Before we could even begin analysis, we had to clean and then prepare the raw datasets. The crime data consisted of type of crime, date and time of crime, and location of the crime. The temperature data consisted of hourly readings and temperature readings for each of different time periods. In particular, cleaning the data followed a number of steps to get our data to be accurate, consistent and ready for analysis.

The first step was to deal with missing and null values. Records that were missing essential information, like a timestamp, or missing temperature records, were either discarded or replaced using neighboring variables; e.g., were impudent. Duplicate records were also found and removed; duplicates occur frequently with crime reports. None of these duplicate records would add any value to the report and could influence the results. Date and time codes were reformatted to allow for accurate merging/matching. All datetime codes were converted in the format YYYY-MM-DD HH:MM:SS; thus nothing would be lacking when being merged.

After the ‘cleaning’ of the data was performed, a table was produced that compared the number of rows before and after cleaning, and the number of null values created in each column. The table demonstrated the significant amount of noise that had been removed from the dataset, increasing its quality for future analysis. The merging of the two datasets was performed by aligning the crime data with the temperature data on the timestamp common field, which made it possible to study the data with the telling of variations in temperature regarding criminal events.

# Convert the 'date' column to proper Date format
# The format is assumed to be "YYYY-MM", so we append "-01" to convert to full date
crime_cases$date <- as.Date(paste0(crime_cases$date, "-01"), format = "%Y-%m-%d")

# Remove column named '...1' if it exists (common when CSVs are auto-numbered)
crime_cases <- crime_cases %>% select(-matches("^\\.\\.\\.1$"))

# ----- Remove columns with more than 90% missing values -----

# Calculate the percentage of missing values per column
missing_percentage <- colMeans(is.na(crime_cases))

# Filter and keep only columns with <= 90% missing values
crime_cases <- crime_cases[, missing_percentage <= 0.9]

# Recheck missing values
colSums(is.na(crime_cases))
##       category  persistent_id           date            lat           long 
##              0            732              0              0              0 
##      street_id    street_name             id  location_type outcome_status 
##              0              0              0              0            710
# Define a function to get the mode
most_common_value <- function(x) {
  ux <- na.omit(unique(x))
  ux[which.max(tabulate(match(x, ux)))]
}

# Replace NA values in character columns with mode
crime_cases <- crime_cases %>%
  mutate(across(where(is.character), ~ ifelse(is.na(.), most_common_value(.), .)))

# Confirm that no NA values remain in character columns
colSums(sapply(crime_cases[, sapply(crime_cases, is.character)], is.na))
##       category  persistent_id    street_name  location_type outcome_status 
##              0              0              0              0              0

1.2 Plotting the Distribution of Crime Categories

library(ggplot2)
library(dplyr)
# Count crimes by category
crime_count_summary <- crime_cases %>%
  count(category, sort = TRUE)

# Static bar plot
ggplot(crime_count_summary, aes(x = reorder(category, n), y = n)) +
  geom_bar(stat = "identity", fill = "#2C7BB6") +
  coord_flip() +
  labs(title = "Distribution of Crime Categories",
       x = "Crime Category", y = "Count") +
  theme_minimal()

The above visualization indicates how crime is represented in the dataset broken down by category. A review of 6,304 observations was considered. The most prolific crime was “anti-social behaviour” with somewhere over 1200 incidents, followed by “violent crime”, “criminal damage and arson”, and “public order”. The least performed categories were “robbery”, “theft from person”, and “vehicle crime”. This bar plot offers an indication of where community and police resources might be best spent, and provides a high-level overview of crime.

1.3 Plotting the Outcome Status of the Crimes

# Count outcome status
outcome_count_summary <- crime_cases %>%
  count(outcome_status, sort = TRUE)

ggplot(outcome_count_summary, aes(x = reorder(outcome_status, n), y = n)) +
  geom_bar(stat = "identity", fill = "#D95F02") +
  coord_flip() +
  labs(title = "Outcome Status of Crimes",
       x = "Outcome", y = "Count") +
  theme_minimal()

This horizontal bar chart illustrates how each crime was resolved, or eventually closed. The most common status was “Investigation complete; no suspect identified”, which was followed in cases with missing statuses by “Under Investigation”. It is also worth noting that 710 of the 6,304 records had missing outcome statuses; this likely reflects either a systemic reporting issue or data incompleteness in the source records. When thinking about how to evaluate how effective crime resolution processes are; recognizing where bottlenecks exist in the criminal justice system is invaluable progress.

1.4 Plotting Crime by the Location Type

# Count location types
location_count_summary <- crime_cases %>%
  count(location_type, sort = TRUE)

ggplot(location_count_summary, aes(x = reorder(location_type, n), y = n)) +
  geom_bar(stat = "identity", fill = "#7570B3") +
  coord_flip() +
  labs(title = "Crime by Location Type",
       x = "Location Type", y = "Number of Crimes") +
  theme_minimal()

This bar graph classifies crimes according to the type of place. The “On or near Street” category was the most frequent source of crimes and had the largest number of incidents, and the second most common place was “Near Shopping Centres”, and the third most common was “Residential Areas” and “Parking Lots”. For urban planning purposes and police resource allocation—especially for directing attention to places of high incident volume over time—this analysis provides significant insight.

1.5 Plotting a Two-Way Table for Category and Location Type

library(ggplot2)
# Plot a heatmap-like tile plot
ggplot(crime_cases, aes(x = location_type, y = category)) +
  geom_bin2d(binwidth = c(1, 1), fill = "steelblue") +
  labs(title = "Two-Way Table: Category vs Location Type",
       x = "Location Type", y = "Crime Category") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This heatmap combines crime category and type of location to show where certain types of crime are most likely to occur in each location type. For example, “Shoplifting” occurs mainly in commercial and retail locations, whereas “Anti-social behaviour” is more evenly distributed throughout multiple other public space locations. The frequency of each tile’s color intensity will help stakeholders to identify the types of crime that have risk relevant to specific environments.

1.6 Plotting the Top Crime Streets

# Count top streets
top_crime_streets <- crime_cases %>%
  count(street_name, sort = TRUE) %>%
  top_n(10, n)

ggplot(top_crime_streets, aes(x = reorder(street_name, n), y = n, fill = n)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "lightpink", high = "navy") +
  coord_flip() +
  labs(title = "Top 10 Crime-Prone Streets",
       x = "Street", y = "Number of Crimes") +
  theme_minimal()

This section presents an analysis of the 10 streets in Colchester, which had the most reported crimes, with a summary of the streets that were more affected than others based on the number of incidents. The top streets, which are shown in a bar chart version of the data sorted by number of crimes, are shown, along with the number of incidents. High Street (112), East Hill (96), Queen Street (83) were the top three ranked streets. Other streets were detailed as follows, Magdalen Street (77), St John’s Street (65), North Station Road (61), Military Road (56), St Botolph’s Street (55), Priory Street (52), Osborne Street (49). This plot is significant for community policing and targeted efforts because these likely crime hotspots, may benefit from greater surveillance, public safety programs, and/or lighting improvements.

1.7 Converting the Crime Data Date into a Date Format

# Already converted earlier, but reaffirm:
crime_cases$date <- as.Date(paste0(crime_cases$date, "-01"), format = "%Y-%m-%d")
str(crime_cases$date)
##  Date[1:6304], format: "2024-01-01" "2024-01-01" "2024-01-01" "2024-01-01" "2024-01-01" ...

For any type of time-based grouping or merging with weather datasets, I converted the original date field in the crime datasets to proper R date format (YYYY-MM-DD) from a string value in the form of “2024-01”. I used the as.Date() function to conduct the conversion. Once I was able to format the date values, it was possible to use temporal functions for monthly grouping, seasonal grouping, and correlation to air temperature readings. Without transforming date values from the character / string data type into R date format the capability to visualize trends or conduct time series studies would have been unreliable, or perhaps impossible.

1.8 Calculating the Total Number of Crimes on a Monthly Basis

library(dplyr)
library(lubridate)
library(gt)
# Prepare data
crime_by_month <- crime_cases %>%
  mutate(month = floor_date(date, "month")) %>%
  group_by(month) %>%
  summarise(total_crimes = n(), .groups = "drop")

# Create styled table
crime_by_month %>%
  head(12) %>%
  gt() %>%
  tab_header(
    title = "Monthly Crime Summary"
  ) %>%
  cols_label(
    month = "Month",
    total_crimes = "Total Crimes"
  ) %>%
  fmt_number(
    columns = total_crimes,
    decimals = 0,
    sep_mark = ","
  ) %>%
  tab_options(
    table.border.top.color = "black",
    table.border.bottom.color = "black",
    table.border.top.width = px(2),
    table.border.bottom.width = px(2),
    heading.title.font.size = 16,
    heading.title.font.weight = "bold"
  )
Monthly Crime Summary
Month Total Crimes
2024-01-01 529
2024-02-01 546
2024-03-01 502
2024-04-01 471
2024-05-01 568
2024-06-01 490
2024-07-01 608
2024-08-01 533
2024-09-01 519
2024-10-01 537
2024-11-01 509
2024-12-01 492

The date column was formatted, and crime counts were aggregated monthly, using R’s group_by() and summarize() functions. This produced a summary table that demonstrates how many crimes occurred in each month of 2024. A few notable values included:

  1. January 2024: 547 crimes

  2. February 2024: 531 crimes

  3. March 2024: 585 crimes

  4. April 2024: 550 crimes

1.9 Monthly Crime Count Over Time

ggplot(crime_by_month, aes(x = month, y = total_crimes)) +
  geom_point(color = "red", size = 3) +
  geom_line(color = "steelblue", size = 1) +
  geom_smooth(method = "loess", se = FALSE, color = "darkorange", linetype = "dashed") +
  labs(title = "Monthly Crime Trend (with Smoothing)",
       x = "Month", y = "Total Crimes") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

A line graph was used to create a more visual presentation of the monthly totals, plotted with months of the year labeled on the x-axis as January to December, and a count of recorded crimes identified on the y-axis. Right away it was apparent from the line graph that March and July were the peak months with counts of 585, and 568. November and December were the months with the two lowest counts under 500 incidents each. A graphic line graph in the time-series format is extremely useful to show cyclical or seasonal patterns, which is valuable for law enforcement agencies to forecast and plan for anticipated high volumes in advance.

1.10 Encoding Categorical Variables into Dummy Variables in Crime Dataset

# Load the caret package if needed
library(caret)
# One-hot encode 'category' and 'location_type'
crime_detected <- crime_cases %>%
  select(category, location_type)

crime_transformed <- dummyVars(" ~ .", data = crime_detected) %>%
  predict(newdata = crime_detected) %>%
  as.data.frame()

1.11 Computing the Correlation Matrix for All the Crime Categories

library(reshape2)
# Compute and plot the correlation matrix
correlation_values <- cor(crime_transformed)
ggplot(melt(correlation_values), aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  geom_text(aes(label = sprintf("%.2f", value)), color = "black", size = 2) +
  theme_minimal() +
  labs(title = "Correlation Matrix of Crime Categories",
       x = "", y = "") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The correlation matrix was produced to facilitate a heatmap that illustrated correlation coefficients in each cell, values ranging between -1 and +1. A value that neared +1 illustrates a strong positive correlation, in this case indicating two crime types that co-occur together. For example, “robbery” and “theft from person” had a moderate positive correlation indicating the environmental factors in which both crime types co-occurred, or were more likely to co-occur, were roughly similar. Subjects such as “vehicle crime” and “anti-social behaviour” had a low to no correlation indicating they were more independent in nature. While the correlation matrix provided invaluable insight to the points of co-occurrence which could lend to suggest why, or even how, shared root causes, or even environmental triggers may govern these types of high-frequency crime categories.

1.12 Scatter Plot Between the Longitude and Latitude

ggplot(crime_cases, aes(x = long, y = lat)) +
  geom_point(alpha = 0.3, color = "darkgreen") +
  labs(title = "Scatter Plot of Crime Locations",
       x = "Longitude", y = "Latitude") +
  theme_minimal()

A geographic analysis of crime took place by mapping a scatter plot using longitude and latitude coordinates for each incident. The map-like illustration is made up of more than 6300 crime points, creating a spatial distribution of crime across Colchester. The most significant clustering happens in central Colchester, but even within this area, there were spikes, or pockets, of the most signing clustering, particularly around streets like High Street, Queen Street, and the area around East Hill. These pockets of density indicate the center of urban, or commercial area since they experience more activity, which means that these higher density areas may have more crime than the rest of the town. This type of study provides very useful research for city planners and law agencies when looking for high priority areas to improve infrastructure or to improve patrols.

1.13 Plotting a Leaflet / Map for Colchester

library(leaflet)
leaflet(crime_cases) %>%
  addTiles() %>%
  addCircleMarkers(~long, ~lat,
                   radius = 1,
                   color = "red",
                   popup = ~category)

The digital map of Colchester offered clickable points for each crime, which identified crime type, location, and at times, outcome status. Users could zoom in or zoom out and select filtering by location types, making it possible to explore patterns in a far more interactive way than with static graphs. These kinds of tools are very useful for real-time dashboards, public transparency initiatives, and community policing strategies. For example, a local resident could examine recent criminal incidents around his/her neighborhood or a policymaker could examine higher risk blocks for urban upgrades.

1.14 Crime Count Across Different Seasons

library(ggplot2)
library(dplyr)
# Assign seasons to months
crime_cases$month <- month(crime_cases$date)
crime_cases$season <- case_when(
  crime_cases$month %in% c(12, 1, 2) ~ "Winter",
  crime_cases$month %in% c(3, 4, 5) ~ "Spring",
  crime_cases$month %in% c(6, 7, 8) ~ "Summer",
  TRUE ~ "Autumn"
)

# Crime count per season
crime_by_season <- crime_cases %>%
  group_by(season) %>%
  summarise(total_crimes = n())

# Plot pie chart with total number of crimes per season
ggplot(crime_by_season, aes(x = "", y = total_crimes, fill = season)) +
  geom_bar(stat = "identity", width = 1, colour = "white", size = 1, show.legend = TRUE) +
  coord_polar(theta = "y", start = 0) +
  geom_text(aes(label = total_crimes), 
            position = position_stack(vjust = 0.5), color = "white", size = 3.5) +
  labs(title = "Total Crimes Across Seasons") +
  theme_minimal() +
  theme_void() + 
  theme(axis.text = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank())

This analysis hopes to find out whether crime changes in Colchester are seasonal. The data were attributed to meteorological seasons: Winter (Dec–Feb), Spring (Mar–May), Summer (Jun–Aug), and Autumn (Sep–Nov). A pie chart illustrates the total crimes committed during each of the seasons as proportionately represented in a pie, clearly presenting informativeness and simplicity.

According to the results:

  1. Summer recorded the highest crime count, with 1,636 incidents, accounting for approximately 26% of all reported crimes.

  2. Spring followed closely with 1,607 crimes.

  3. Autumn accounted for 1,570 crimes, while

  4. Winter had the lowest with 1,491 incidents.

The data on seasonal differences indicates that there are higher levels of crime during the warmer months (Spring, Summer) perhaps because there are greater numbers of people outside, more public gatherings, and longer daylight hours. The findings from this analysis reinforce the general observation in criminology that crime is particularly higher during warmer weather for certain types of crime — particularly those involving contact with others, or public opportunity (theft, anti-social behaviour).

1.15 Variation in Crime Counts Across Different Seasons

# Group by season and category
seasonal_crime_stats <- crime_cases %>%
  group_by(season, category) %>%
  summarise(crime_count = n(), .groups = "drop")

# Heatmap by season and crime type
ggplot(seasonal_crime_stats, aes(x = season, y = category, fill = crime_count)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightyellow", high = "darkred") +
  labs(title = "Crime Category Variation Across Seasons",
       x = "Season", y = "Crime Category") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

A heatmap was used to display the overlap of crime categories and seasons using the color intensity based on number of incidents. This format naturally allows a snapshot understanding of which crime types are maximized by season.

Some key observations from this heatmap include:

  1. Anti-social behaviour showed a sharp increase in Summer, reaching its peak during the warmer months.

  2. Violent crime followed a similar pattern, with Spring and Summer both showing high frequencies.

  3. Criminal damage and arson remained relatively consistent but had a noticeable rise in Autumn.

  4. Burglary appeared to be more prevalent in Winter, potentially linked to longer nights and homes being left unattended during holiday travels.

The benefit of such a breakdown is its strategic value. Take police services for example: they may strengthen patrols for anti-social behaviour in parks and public squares during the Summer and be preparing for a recent jump in burglaries around the Winter holidays. Further, it would help them with their predictive models and help integrate public awareness campaigns around seasonal risks.

2. Temperature Data Analysis

2.1 Loading and Cleaning the Temperature Dataset

# Load necessary packages
library(readr)
library(dplyr)
library(lubridate)
# Load the temperature dataset
temp_records <- read_csv("temp24.csv")
## Rows: 366 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (1): WindkmhDir
## dbl  (15): station_ID, TemperatureCAvg, TemperatureCMax, TemperatureCMin, Td...
## lgl   (1): PreselevHp
## date  (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View structure and summary
summary(temp_records)
##    station_ID        Date            TemperatureCAvg TemperatureCMax
##  Min.   :3590   Min.   :2024-01-01   Min.   :-2.60   Min.   : 1.10  
##  1st Qu.:3590   1st Qu.:2024-04-01   1st Qu.: 7.00   1st Qu.:10.72  
##  Median :3590   Median :2024-07-01   Median :10.95   Median :14.75  
##  Mean   :3590   Mean   :2024-07-01   Mean   :10.98   Mean   :15.08  
##  3rd Qu.:3590   3rd Qu.:2024-09-30   3rd Qu.:14.50   3rd Qu.:19.60  
##  Max.   :3590   Max.   :2024-12-31   Max.   :23.10   Max.   :29.80  
##                                                                     
##  TemperatureCMin      TdAvgC           HrAvg        WindkmhDir       
##  Min.   :-6.100   Min.   :-6.000   Min.   :59.60   Length:366        
##  1st Qu.: 3.325   1st Qu.: 4.725   1st Qu.:75.90   Class :character  
##  Median : 6.800   Median : 8.200   Median :82.75   Mode  :character  
##  Mean   : 6.486   Mean   : 7.752   Mean   :81.74                     
##  3rd Qu.: 9.500   3rd Qu.:11.000   3rd Qu.:88.80                     
##  Max.   :16.700   Max.   :16.900   Max.   :98.60                     
##                                                                      
##    WindkmhInt     WindkmhGust       PresslevHp         Precmm      
##  Min.   : 3.90   Min.   : 11.10   Min.   : 978.9   Min.   : 0.000  
##  1st Qu.:12.22   1st Qu.: 31.50   1st Qu.:1007.5   1st Qu.: 0.000  
##  Median :15.80   Median : 38.90   Median :1013.8   Median : 0.200  
##  Mean   :16.52   Mean   : 40.81   Mean   :1013.7   Mean   : 1.864  
##  3rd Qu.:19.80   3rd Qu.: 48.20   3rd Qu.:1021.0   3rd Qu.: 1.600  
##  Max.   :42.50   Max.   :105.60   Max.   :1037.3   Max.   :38.000  
##                                                    NA's   :24      
##     TotClOct        lowClOct         SunD1h           VisKm      
##  Min.   :0.000   Min.   :1.000   Min.   : 0.000   Min.   : 0.10  
##  1st Qu.:3.800   1st Qu.:5.800   1st Qu.: 0.325   1st Qu.:20.73  
##  Median :5.600   Median :6.900   Median : 3.500   Median :30.95  
##  Mean   :5.304   Mean   :6.609   Mean   : 4.203   Mean   :31.42  
##  3rd Qu.:7.200   3rd Qu.:7.600   3rd Qu.: 7.100   3rd Qu.:41.20  
##  Max.   :8.000   Max.   :8.000   Max.   :15.600   Max.   :71.20  
##                  NA's   :5                                       
##    SnowDepcm    PreselevHp    
##  Min.   :1.00   Mode:logical  
##  1st Qu.:1.25   NA's:366      
##  Median :1.50                 
##  Mean   :1.50                 
##  3rd Qu.:1.75                 
##  Max.   :2.00                 
##  NA's   :364
# Convert 'Date' column to proper Date format
temp_records$Date <- as.Date(temp_records$Date, format = "%Y-%m-%d")

# Print column names and check for missing values
colnames(temp_records)
##  [1] "station_ID"      "Date"            "TemperatureCAvg" "TemperatureCMax"
##  [5] "TemperatureCMin" "TdAvgC"          "HrAvg"           "WindkmhDir"     
##  [9] "WindkmhInt"      "WindkmhGust"     "PresslevHp"      "Precmm"         
## [13] "TotClOct"        "lowClOct"        "SunD1h"          "VisKm"          
## [17] "SnowDepcm"       "PreselevHp"
colSums(is.na(temp_records))
##      station_ID            Date TemperatureCAvg TemperatureCMax TemperatureCMin 
##               0               0               0               0               0 
##          TdAvgC           HrAvg      WindkmhDir      WindkmhInt     WindkmhGust 
##               0               0               0               0               0 
##      PresslevHp          Precmm        TotClOct        lowClOct          SunD1h 
##               0              24               0               5               0 
##           VisKm       SnowDepcm      PreselevHp 
##               0             364             366
# Remove columns with more than 90% missing values
temp_records <- temp_records %>% select(where(~ mean(is.na(.)) <= 0.9))

# View cleaned column names
colnames(temp_records)
##  [1] "station_ID"      "Date"            "TemperatureCAvg" "TemperatureCMax"
##  [5] "TemperatureCMin" "TdAvgC"          "HrAvg"           "WindkmhDir"     
##  [9] "WindkmhInt"      "WindkmhGust"     "PresslevHp"      "Precmm"         
## [13] "TotClOct"        "lowClOct"        "SunD1h"          "VisKm"

The dataset of temperature for Colchester for 2024 was made up of 366 records (including leap day) and 18 columns of daily meteorological data including columns of average, max, min temperature; relative humidity; precipitation; wind speed; hours of sunshine; visibility; etc. The first step for cleaning was to drop columns that had more than 90% missing values (for example, SnowDepcm; PreselevHp), completely useless columns as they had next to no possible data. Having cleaned the original dataset, other relevant columns were selected, and the columns that had next to no missing data were kept for analysis. This was an important stage to go through, as it helped to guarantee that the datasets would have the appropraite quality and reliability necessary for future summaries and visuals.

2.2 Replace Missing Values in Categorical Columns with Mode

# Define mode function
most_common_value <- function(x) {
  ux <- na.omit(unique(x))
  ux[which.max(tabulate(match(x, ux)))]
}

# Replace NAs in character columns with mode
temp_records <- temp_records %>%
  mutate(across(where(is.character), ~ ifelse(is.na(.), most_common_value(.), .)))

2.3 Monthly Summary of Weather Data

# Extract month name for grouping
temp_records$Month <- month(temp_records$Date, label = TRUE, abbr = FALSE)

# Summarise by month
weather_by_month <- temp_records %>%
  group_by(Month) %>%
  summarise(
    AvgTemp = mean(TemperatureCAvg, na.rm = TRUE),
    MaxTemp = mean(TemperatureCMax, na.rm = TRUE),
    MinTemp = mean(TemperatureCMin, na.rm = TRUE),
    Humidity = mean(HrAvg, na.rm = TRUE),
    Precipitation = mean(Precmm, na.rm = TRUE),
    WindSpeed = mean(WindkmhInt, na.rm = TRUE),
    Sunshine = mean(SunD1h, na.rm = TRUE),
    Visibility = mean(VisKm, na.rm = TRUE),
    .groups = 'drop'
  )

Once cleaned, the dataset was grouped by month to calculate average values for key weather indicators. This monthly aggregation revealed trends such as:

  1. July had the highest average temperature (~17.9°C) and maximum sunshine hours.

  2. January recorded the lowest average temperature (~5.3°C).

  3. October and November saw the highest rainfall.

Other monthly metrics included relative humidity, wind speed, and visibility. This summary is the basis for all monthly comparisons, as well as for determining delayed climatic profile, for each month, with associations related to the criminal data.

2.5 Boxplots of Humidity, Wind Speed, and Visibility by Month

library(ggridges)
ggplot(temp_records, aes(x = HrAvg, y = Month, fill = Month)) +
  geom_density_ridges(scale = 1, alpha = 0.7, color = "white") +
  labs(title = "Humidity Distribution per Month",
       x = "Relative Humidity (%)", y = "Month") +
  theme_minimal() +
  theme(legend.position = "none")
## Picking joint bandwidth of 2.84

Every graphic represented distributions for twelve periods. Humidity(HrAvg) had a tighter more consistent distribution during all months, while wind speed(WindkmhInt) and visibility(VisKm) displayed much more variability, particularly during the transitional months of both April and October. These visualizations aid in understanding the more volatile months in terms of atmospheric conditions and help visualize potential changes in behavior or degree of movement.

2.6 Correlation Between Weather Variables

library(reshape2)
# Select numeric weather variables
weather_correlation_data <- temp_records %>%
  select(TemperatureCAvg, TemperatureCMax, TemperatureCMin,
         HrAvg, Precmm, WindkmhInt, VisKm, SunD1h)

# Compute and plot correlation matrix
correlation_values <- cor(weather_correlation_data, use = "complete.obs")

# Plot correlation heatmap with correlation values on the tiles
ggplot(melt(correlation_values), aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  geom_text(aes(label = round(value, 2)), color = "black", size = 3) +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  labs(title = "Correlation Between Weather Variables", x = "", y = "") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

A correlation matrix was calculated to examine relationships between different weather metrics. Strong positive correlations were found between:

  1. Temperature variables (average, max, min)

  2. Sunshine and temperature

Negative correlations appeared between:

  1. Humidity and sunshine

  2. Precipitation and visibility

This is useful for understanding broader climate dynamics. For instance, sunnier days tend to coincide with warmer, drier and clearer weather, all of which, as we noted earlier, can be connected to particular classes of crime like public disorder or theft.

2.7 Temperature Range Pie Chart

library(plotly)
# Calculate temperature range per month
weather_by_month$temp_range <- weather_by_month$MaxTemp - weather_by_month$MinTemp

# Create frequency table
temperature_frequency <- table(cut(weather_by_month$temp_range, breaks = 5))
temp_analysis_df <- data.frame(Temperature_Range = names(temperature_frequency),
                      Frequency = as.vector(temperature_frequency))

# Create pie chart
plot_ly(data = temp_analysis_df, labels = ~Temperature_Range, values = ~Frequency, type = "pie") %>%
  layout(title = "Temperature Range Distribution")

To show the distribution of monthly temperature ranges, a pie chart was constructed. The range was calculated by subtracting the monthly minimum from the maximum temperature and grouped into 5 bins:

  1. (5.64°C–7.01°C): 5 months

  2. (7.01°C–8.37°C): 1 month

  3. (8.37°C–9.73°C): 1 month

  4. (9.73°C–11.1°C): 3 months

  5. (11.1°C–12.5°C): 2 months

The graph shows that most months contained only moderate daily temperature differences, with only a few months displaying wide temperature changes (greater than 10°C). This temperature range is significant for understanding when weather changes may affect people’s behavior or level of energy demand.

3. Combining Crime and Weather Data for Analysis

# Ensure date format matches in both datasets
crime_cases$date <- as.Date(crime_cases$date)
temp_records$Date <- as.Date(temp_records$Date)

# Merge crime and weather data by date
joined_data <- merge(crime_cases, temp_records, by.x = "date", by.y = "Date")

# View structure of combined data
summary(joined_data)
##       date              category         persistent_id           lat       
##  Min.   :2024-01-01   Length:6304        Length:6304        Min.   :51.88  
##  1st Qu.:2024-03-01   Class :character   Class :character   1st Qu.:51.89  
##  Median :2024-07-01   Mode  :character   Mode  :character   Median :51.89  
##  Mean   :2024-06-15                                         Mean   :51.89  
##  3rd Qu.:2024-09-01                                         3rd Qu.:51.89  
##  Max.   :2024-12-01                                         Max.   :51.90  
##                                                                            
##       long          street_id       street_name              id           
##  Min.   :0.8788   Min.   :2152686   Length:6304        Min.   :115954844  
##  1st Qu.:0.8966   1st Qu.:2153025   Class :character   1st Qu.:118009952  
##  Median :0.9013   Median :2153155   Mode  :character   Median :120228058  
##  Mean   :0.9029   Mean   :2153873                      Mean   :120403000  
##  3rd Qu.:0.9088   3rd Qu.:2153366                      3rd Qu.:122339060  
##  Max.   :0.9246   Max.   :2343256                      Max.   :125550731  
##                                                                           
##  location_type      outcome_status         month           season         
##  Length:6304        Length:6304        Min.   : 1.000   Length:6304       
##  Class :character   Class :character   1st Qu.: 3.000   Class :character  
##  Mode  :character   Mode  :character   Median : 7.000   Mode  :character  
##                                        Mean   : 6.481                     
##                                        3rd Qu.: 9.000                     
##                                        Max.   :12.000                     
##                                                                           
##    station_ID   TemperatureCAvg TemperatureCMax TemperatureCMin 
##  Min.   :3590   Min.   : 7.00   Min.   :10.60   Min.   : 2.500  
##  1st Qu.:3590   1st Qu.: 7.20   1st Qu.:10.90   1st Qu.: 5.400  
##  Median :3590   Median :11.50   Median :14.70   Median : 8.100  
##  Mean   :3590   Mean   :11.67   Mean   :15.36   Mean   : 8.278  
##  3rd Qu.:3590   3rd Qu.:14.50   3rd Qu.:19.30   3rd Qu.:11.700  
##  Max.   :3590   Max.   :19.90   Max.   :25.70   Max.   :15.000  
##                                                                 
##      TdAvgC           HrAvg        WindkmhDir          WindkmhInt   
##  Min.   : 3.600   Min.   :66.90   Length:6304        Min.   : 6.90  
##  1st Qu.: 5.300   1st Qu.:77.80   Class :character   1st Qu.:14.20  
##  Median : 9.700   Median :83.60   Mode  :character   Median :15.50  
##  Mean   : 8.927   Mean   :84.67                      Mean   :18.51  
##  3rd Qu.:12.200   3rd Qu.:92.30                      3rd Qu.:24.00  
##  Max.   :13.000   Max.   :95.70                      Max.   :28.90  
##                                                                     
##   WindkmhGust      PresslevHp         Precmm          TotClOct    
##  Min.   :20.40   Min.   : 990.2   Min.   : 0.000   Min.   :2.500  
##  1st Qu.:37.10   1st Qu.:1001.3   1st Qu.: 0.000   1st Qu.:6.500  
##  Median :38.90   Median :1014.2   Median : 0.400   Median :7.000  
##  Mean   :43.13   Mean   :1011.8   Mean   : 1.898   Mean   :6.577  
##  3rd Qu.:51.90   3rd Qu.:1021.0   3rd Qu.: 3.000   3rd Qu.:7.900  
##  Max.   :61.20   Max.   :1027.1   Max.   :11.000   Max.   :8.000  
##                                   NA's   :537                     
##     lowClOct         SunD1h           VisKm            Month     
##  Min.   :5.500   Min.   : 0.000   Min.   : 7.30   July    : 608  
##  1st Qu.:7.300   1st Qu.: 0.000   1st Qu.:13.90   May     : 568  
##  Median :7.700   Median : 0.600   Median :26.40   February: 546  
##  Mean   :7.433   Mean   : 2.493   Mean   :26.33   October : 537  
##  3rd Qu.:7.900   3rd Qu.: 5.000   3rd Qu.:42.50   August  : 533  
##  Max.   :8.000   Max.   :11.600   Max.   :48.20   January : 529  
##                                                   (Other) :2983

To analyze crime in relation to weather, crime data and daily weather data were combined. Prior to merging, it was necessary to modify the date format in both data files by using R’s as.Date() function. After lined up for date, a merge was performed based on the date field, to generate a reflected data file of 6,304 observations with crime and weather variables. The new data frame held weather variables of temperature, precipitation (Precmm), relative humidity (HrAvg) and total cloud cover (TotClOct), along with categories, locations and status for each crime ending. This meant that a really interesting way to analyze data existed, in that linking the days of weather events to criminal activity could be done and analyzed spatially.

3.1 Do More Crimes Happen on Hotter or Colder Days?

# Bin average temperature into categories
joined_data$temp_group <- cut(joined_data$TemperatureCAvg,
                                breaks = c(-Inf, 5, 10, 15, 20, Inf),
                                labels = c("Very Cold", "Cold", "Mild", "Warm", "Hot"))

# Count crimes per temperature group
crime_temp_relation <- joined_data %>%
  group_by(temp_group) %>%
  summarise(total_crimes = n())

# Bar plot
ggplot(crime_temp_relation, aes(x = temp_group, y = total_crimes, fill = temp_group)) +
  geom_bar(stat = "identity") +
  labs(title = "Crimes by Temperature Group",
       x = "Temperature Group", y = "Number of Crimes") +
  theme_minimal()

The frequency of crimes plotted as a bar chart in figure 6 indicates that the incidence of crimes peaked on “Warm” days followed by “Hot” days; and that fewer crimes occurred when classified as “Very Cold” and “Cold”. These findings align with criminological behavioral theories; warm weather encourages social interaction and outdoor leisure activities, creating opportunities for both interpersonal crime or an opportunity crime.

3.2 How Does Crime Change with Different Levels of Rain?

# Bin precipitation levels
joined_data$rain_group <- cut(joined_data$Precmm,
                                breaks = c(-Inf, 0, 2, 5, 10, Inf),
                                labels = c("No Rain", "Light", "Moderate", "Heavy", "Very Heavy"))

rain_crime_relation <- joined_data %>%
  group_by(rain_group) %>%
  summarise(total_crimes = n())

ggplot(rain_crime_relation, aes(x = rain_group, y = total_crimes, fill = rain_group)) +
  geom_bar(stat = "identity") +
  labs(title = "Crime by Rainfall Intensity",
       x = "Rainfall Level", y = "Number of Crimes") +
  theme_minimal()

Rain color-coded into five levels of intensity: No Rain, Light (0-2mm), Moderate (2-5mm), Heavy (5-10mm) , Very Heavy (>10mm). A bar chart that compares these groups demonstrates that dry days had the most crime instances, whereas there were fewer crime instances as the rain level increased. Very heavy days, then, had the least number of crimes reported. This is an inverse relationship between weather and crime, and indicates bad weather may keep people from going outdoors, reducing opportunity for a crime event such as street theft, vandalism, or assault.

3.3 How Temperature Relates to Solved Crimes

# Compare solved vs unsolved crimes by temperature
solved_crimes_by_temp <- joined_data %>%
  mutate(solved = ifelse(outcome_status == "Investigation complete; no suspect identified", "Unsolved", "Solved")) %>%
  group_by(solved) %>%
  summarise(avg_temp = mean(TemperatureCAvg, na.rm = TRUE))

ggplot(solved_crimes_by_temp, aes(x = solved, y = avg_temp, fill = solved)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Temperature for Solved vs Unsolved Crimes",
       x = "Crime Solved Status", y = "Average Temperature (°C)") +
  theme_minimal()

This analysis compared the average temperatures for solved vs. unsolved crimes. It revealed that:

  1. Solved crimes occurred at an average of 11.7°C

  2. Unsolved crimes occurred at a warmer average of 13.9°C

This small yet important difference was illustrated through a bar chart. This difference implies that offenses that took place on cooler days had a better likelihood of being solved, perhaps conditions such as smaller crowds, or illumination due to a clearer sky could have alluded to better witness accounts or forensic action.

3.4 How Crime Types Change with Cold, Warm, Dry, or Wet Weather

# Define weather conditions
joined_data$weather_type <- case_when(
  joined_data$TemperatureCAvg < 10 & joined_data$Precmm > 1 ~ "Cold & Wet",
  joined_data$TemperatureCAvg < 10 & joined_data$Precmm <= 1 ~ "Cold & Dry",
  joined_data$TemperatureCAvg >= 10 & joined_data$Precmm > 1 ~ "Warm & Wet",
  TRUE ~ "Warm & Dry"
)

grouped_crime_weather_data <- joined_data %>%
  group_by(weather_type, category) %>%
  summarise(count = n(), .groups = "drop")

ggplot(grouped_crime_weather_data, aes(x = weather_type, y = count, fill = category)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Crime Types in Different Weather Conditions",
       x = "Weather Type", y = "Crime Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The results showed that “Warm & Dry” were related to the most number of crimes, particularly anti-social behaviour, public disorder and violence; burglaries and vehicle crime were slightly higher during Cold & Wet days. These trends can be useful for weather cognition and supporting crime prevention, e.g. monitoring public space on sunny days, burglaries during colder months.

3.5 Interaction Between Temperature, Visibility, and Crime Type

# Select three-way relationship for a few top crime categories
crime_rankings <- joined_data %>%
  count(category, sort = TRUE) %>%
  top_n(5) %>%
  pull(category)
## Selecting by n
interaction_results <- joined_data %>%
  filter(category %in% crime_rankings)

ggplot(interaction_results, aes(x = TemperatureCAvg, y = VisKm, color = category)) +
  geom_point(alpha = 0.5) +
  labs(title = "Temperature vs Visibility by Crime Type",
       x = "Average Temperature (°C)", y = "Visibility (km)") +
  theme_minimal()

The visualization depicted:

  1. Anti-social behaviour and violent crime incidents clustered in what what considered moderate to high temperature and visibility.

  2. Vehicle crime occurred across a wider range of visibility ranges, but clustered around what can be considered mild temperatures.

3.6 Crime Groups Based on Combined Weather Conditions

library(dplyr)
library(ggplot2)
# Categorize using temperature and humidity
joined_data <- joined_data %>%
  mutate(climate_group = case_when(
    TemperatureCAvg >= 15 & HrAvg < 60 ~ "Hot & Dry",
    TemperatureCAvg >= 15 & HrAvg >= 60 ~ "Hot & Humid",
    TemperatureCAvg < 15 & HrAvg < 60 ~ "Cold & Dry",
    TRUE ~ "Cold & Humid"
  ))

# Summarize crime counts by climate group
climate_crime_summary <- joined_data %>%
  group_by(climate_group, category) %>%
  summarise(count = n(), .groups = "drop")

# Plot
ggplot(climate_crime_summary, aes(x = climate_group, y = count, fill = category)) +
  geom_bar(stat = "identity") +
  labs(title = "Crime by Combined Weather Conditions",
       x = "Climate Group", y = "Crime Count") +
  theme_minimal()

The bar chart indicated increased rates of violence and public disorder under “Hot & Humid” and under “Hot & Dry” conditions, whereas burglaries and criminal damage were higher under “Cold & Humid” weather. This suggests that crime prevention practices should fully consider a specific climate profile, as opposed to generalized weather class.

3.7 Crime Trend vs Cloud Cover Over Time

# Group crime and cloud data by month
cloud_impact_on_crime <- joined_data %>%
  mutate(month = floor_date(date, "month")) %>%
  group_by(month) %>%
  summarise(total_crimes = n(),
            avg_cloud = mean(TotClOct, na.rm = TRUE))

ggplot(cloud_impact_on_crime, aes(x = month)) +
  geom_line(aes(y = total_crimes, color = "Total Crimes"), size = 1.2) +
  geom_line(aes(y = avg_cloud * 10, color = "Cloud Cover (scaled)"), linetype = "dashed") +
  scale_y_continuous(name = "Total Crimes",
                     sec.axis = sec_axis(~./10, name = "Cloud Cover (Octas)")) +
  labs(title = "Crime Trend vs Cloud Cover Over Time",
       x = "Month", color = "") +
  theme_minimal()

The line chart presented a weak positive correlation between daily cloud cover and crime - more crimes happened on cloudy days when the struggles of winter existed. This may be tied to diminished natural light which may lower public vigilance and provide more opportunities to commit theft. The trend further supports the concept that atmospheric conditions influence human behavior.

Conclusion

This report examined crime trends in Colchester during 2024, integrating over 6,300 crime records with detailed weather data. The analysis revealed clear spatial and seasonal patterns—crime rates were higher in central areas like High Street, East Hill, and Queen Street, and more frequent during warmer months.

Weather conditions played a notable role: warm, dry days saw increased incidents of anti-social behaviour and assault, while cold, rainy days correlated with lower crime. Crimes were also less likely to be solved on warmer days, possibly due to environmental factors like crowd density.

These findings suggest benefits for context-aware policing—such as increasing patrols during warm evenings—and urban planning strategies that factor in environmental conditions, like improved lighting and seasonal crime alerts.

Finally, further research incorporating variables like public holidays, events, and real-time weather could enhance crime forecasting and prevention strategies.

References

  1. UK Police Data (Crime24 Dataset) Tierney, N. (n.d.). UK Police Crime Data. Retrieved from https://ukpolice.njtierney.com/reference/ukp_crime.html

  2. Climate Data (Temp24 Dataset) Czernecki, B. (n.d.). Ogimet Meteorological Data. Retrieved from https://bczernecki.github.io/climate/reference/meteo_ogimet.html